11. T-tests

Johnny van Doorn

University of Amsterdam

2025-09-30

Block 2: Life is Mean

Comparing 2 means (t-test)
Comparing 2 or more means (ANOVA)
Between subjects vs. Within subjects
Adding predictors

In this lecture we aim to:

Introduce the t-test for comparing means
- One mean
- Two means, within subjects
- Two means, between subjects
What numbers can we look at? Where do they come from?

Reading: Chapter 9

One-sample t-test

Compare 1 group mean to a hypothesized value

Measuring IQ

Models

\[\text{outcome} = \text{model} + \text{error}\]

\(\mathcal{H}_0: \text{model} = 120\)

\(\mathcal{H}_A: \text{model} = \bar{x}\)

Hypotheses

Null hypothesis

\(H_0: \mu = 120\)

Alternative hypotheses

\(H_A: \mu \neq 120\)
\(H_A: \mu > 120\)
\(H_A: \mu < 120\)

Can you phrase research questions that would lead you to each of these three versions of \(\mathcal{H}_A\)?

Compare sample mean to 120

We use the one-sample t-test to compare the sample mean \(\bar{x}\) to the mean that is hypothesized by \(\mathcal{H}_0\): \(\mu = 120\). Let’s take a look at our sample:

mu     <- 120
n      <- length(IQ.next.to.you)
x      <- IQ.next.to.you
mean_x <- mean(x, na.rm = TRUE)
sd_x   <- sd(x, na.rm = TRUE)
cbind(n, mean_x, sd_x)

      n   mean_x     sd_x
[1,] 96 117.8854 19.73288

Does this mean differ significantly from \(\mathcal{H}_0:\) \(\mu = 120\)?

Assumptions

Normally distributed residuals (i.e., model errors)
Random samples

T-statistic

\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}} = \frac{117.89 - 120 }{19.73 / \sqrt{96}}\]

So the t-statistic represents the deviation of the sample mean \(\bar{x}\) from the population mean \(\mu\), considering the sample size.

tStat <- (mean_x - mu) / (sd_x / sqrt(n)); tStat

[1] -1.049953

Type I error

To determine if this t-value significantly differs from the population mean we have to specify a type I error that we are willing to make.

Type I error / \(\alpha\) = .01

P-value two sided

Finally we have to calculate our \(p\)-value for which we need the degrees of freedom \(df = n - 1\) to determine the shape of the t-distribution.

\[ \mathcal{H}_A: \mu \neq 120 \rightarrow t \neq 0 \]

P-value one sided

\[ \mathcal{H}_A: \mu > 120 \rightarrow t > 0 \]

P-value one sided

\[ \mathcal{H}_A: \mu < 120 \rightarrow t < 0 \]

Effect-size Cohen’s \(d\)

\(t\)-statistic can be significant because of a big effect and/or a high sample size
Effect-size Cohen’s \(d\) is unaffected by sample size
See here for additional ramblings

\[d = \frac{t}{\sqrt{n}}\]

d <- tStat / sqrt(n)

d

[1] -0.1071604

Applet

Link

Paired-samples t-test

Compare 2 dependent/paired group means

Paired-samples t-test

In the Paired samples t-test the deviation (\(D\)) for each pair is calculated and the mean of these deviations (\(\bar{D}\)) is tested against the null hypothesis where \(\mu = 0\).

\[t_{n-1} = \frac{\bar{D} - \mu}{ {SE}_D }\] Where \(n\) (the number of cases) minus \(1\), are the degrees of freedom \(df = n - 1\) and \(SE_D\) is the standard error of \(D\), defined as \(s_D/\sqrt{n}\).

Hypothesis

\[\LARGE{ \begin{aligned} H_0 &: \bar{D} = \mu_D \\ H_A &: \bar{D} \neq \mu_D \\ H_A &: \bar{D} > \mu_D \\ H_A &: \bar{D} < \mu_D \\ \end{aligned}}\]

Assumptions

Normally distributed residuals (i.e., model errors)
Random samples

Wide data structure

index	k1	k2
1	x	x
2	x	x
3	x	x
4	x	x

Where \(k\) is the level of the categorical predictor variable and \(x\) is the value of the outcome/dependent variable.

Long vs. wide format

Data example

We are going to use the IQ estimates we collected. You had to guess your neighbor’s IQ and your own IQ.

Let’s take a look at the data.

IQ estimates

Calculate \(D\)

diffScores <- IQ.next.to.you - IQ.you

Calculate \(\bar{D}\)

diffScores      <- na.omit(diffScores) # get rid of all missing values
diffMean        <- mean(diffScores)
diffMean

[1] -2.395833

And we also need n.

n <- length(diffScores)
n

[1] 96

Calculate t-value

\[t_{n-1} = \frac{\bar{D} - \mu}{ {SE}_D }\]

mu <- 0                # Define mu

diffSD <- sd(diffScores)   # Calculate standard deviation
diffSE <- diffSD / sqrt(n) # Calculate standard error

df   <- n - 1          # Calculate degrees of freedom

# Calculate t
tStat <- ( diffMean - mu ) / diffSE
tStat

[1] -0.8280205

Test for significance

\[ \mathcal{H}_A: \mu_D \neq 0 \rightarrow t \neq 0 \]

Effect-size \(d\)

\[d = \frac{t}{\sqrt{n}}\]

d <- tStat/(sqrt(n))
d

[1] -0.08450949

Independent-samples t-test

Compare 2 independent group means

Independent-samples t-test

In the independent-samples t-test the mean of both independent samples is calculated and the difference of these \((\bar{X}_1 - \bar{X}_2)\) means is tested against the null hypothesis where \(\mu = 0\).

\[t_{n_1 + n_2 -2} = \frac{(\bar{X}_1 - \bar{X}_2) - \mu}{{SE}_p}\] Where \(n_1\) and \(n_2\) are the number of cases in each group and \(SE_p\) is the pooled standard error.

Hypothesis

\[\begin{aligned} H_0 &: t = 0 = \mu_t \\ H_A &: t \neq 0 \\ H_A &: t > 0 \\ H_A &: t < 0 \\ \end{aligned}\]

Long data structure

index	k	outcome
1	1	x
2	1	x
3	2	x
4	2	x

Where \(k\) is the level of the categorical predictor variable and \(x\) is the value of the outcome/dependent variable.

Assumptions

Normally distributed residuals (i.e., model errors)
Random samples

Specific for independent sample \(t\)-test:

Equality of variance
- Assess with observed sd’s or Levene’s test
- Use Welch \(t\)-test (robust to unequal variances)

Example

We are going to use the IQ estimates we collected. You had to guess the IQ of the one sitting next to you and your own IQ. Do your guesses differ from the guesses from last year??

The data

Calculate means

iq2024.mean <- mean(iq2024, na.rm = TRUE)
iq2025.mean <- mean(iq2025, na.rm = TRUE)

rbind(iq2024.mean, iq2025.mean)

                [,1]
iq2024.mean 116.8841
iq2025.mean 117.8854

Calculate variance

iq2024.var   <- var(iq2024,   na.rm = TRUE)
iq2025.var <- var(iq2025, na.rm = TRUE)
print(rbind(iq2024.var, iq2025.var))

               [,1]
iq2024.var 259.2805
iq2025.var 389.3867

iq2024.n   <- length(iq2024)
iq2025.n <- length(iq2025)
n <- iq2024.n + iq2025.n
print(rbind(iq2024.n, iq2025.n))

         [,1]
iq2024.n   69
iq2025.n   96

Calculate t-value

\[t_{n_1 + n_2 -2} = \frac{(\bar{X}_1 - \bar{X}_2) - \mu}{{SE}_p}\]

Where \({SE}_p\) is the pooled standard error.

\[{SE}_p = \sqrt{\frac{S^2_p}{n_1}+\frac{S^2_p}{n_2}}\]

And \(S^2_p\) is the pooled variance.

\[S^2_p = \frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2}\]

Where \(s^2\) is the variance and \(n\) the sample size.

Calculate pooled variance

\[S^2_p = \frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2}\]

df <- iq2024.n + iq2025.n - 2
pooledVar <- ( (iq2024.n-1)*iq2024.var + (iq2025.n-1)*iq2025.var ) / df

df

[1] 163

pooledVar

[1] 335.1093

Calculate pooled SE

\[ {SE}_p = \sqrt{\frac{S^2_p}{n_1}+\frac{S^2_p}{n_2}} \]

sePooled <- sqrt( ((pooledVar/iq2024.n) + (pooledVar/iq2025.n)) )
sePooled

[1] 2.889183

Calculate t-value

\[t_{n_1 + n_2 -2} = \frac{(\bar{X}_1 - \bar{X}_2) - \mu}{{SE}_p}\]

tStat <- ( iq2024.mean - iq2025.mean ) / sePooled

tStat

[1] -0.3465889

Test for significance

\[ \mathcal{H}_A: \mu_1 \neq \mu_2 \rightarrow t \neq 0 \]

Effect-size d

\[d = \frac{2t}{\sqrt{n_1 + n_2}}\]

d <- 2*tStat / sqrt(n)

d

[1] -0.05396382

But what about equal variances?

There exist different hypothesis tests for this - the most used is Levene’s test:

Levene's Test for Homogeneity of Variance (center = median)
       Df F value Pr(>F)
group   1  0.0986  0.754
      163

But more nuance lies in comparing observed sd’s or variances: ::: {.cell} ::: {.cell-output .cell-output-stdout}

    2024     2025 
16.10219 19.73288

::: :::

But what about equal variances?

Warning

Levene’s test (and other significance tests like it, such as Shapiro for normality) are heavily influenced by sample size, so a significant test result does not necessarily mean that you have a problem. A more pragmatic rule of thumb is to look at the ratio of variances - a ratio greater than 2 is problematic. Additionally, Welch \(t\)-test is a version of the \(t\)-test that is robust to unequal variances.

Welch \(t\)-test is more robust

Unequal variances bias the sampling distribution of \(t\). Welch makes a correction:

It uses the unpooled SE
It lowers the df
- More inequality = more reduction

\[SE_{\text{unpooled}} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

Normality

Assess using a plot (Q-Q plot)
- Points should be along the diagonal
- Not exact
Test using hypothesis test (Shapiro-Wilk)
- \(p < \alpha \rightarrow\) assumption violated
- Same caution as Levene’s test (see also Jane Superbrain Box 6.7)
- Plus caution as with any p-value (black-and-white decision making)
Becomes less important as \(n\) grows (\(n \approx 30\))

Handling Assumption violations

Warning

Assumption violations affect the shape of the sampling distribution and mess up the type 1/2 error rates

Unequal variances?
- Welch \(t\)-test
Non-normality?
- Use nonparametric test (Chapter 15)
- Bootstrapping (Section 6.10.5; bonus)

Closing

Recommended Exercises

Exercise 9.1, Exercise 9.3, Exercise 9.4
How would conclusions/interpretations differ for \(\alpha = 0.01\)?
How are the assumptions looking for these data sets?

Contact